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Abstract 

Moving beyond the dualistic view in AI where agent and environment 
are separated incurs new challenges for decision making, as calculation of 
expected utility is no longer straightforward. The non-dualistic decision 
theory literature is split between causal decision theory and evidential 
decision theory. We extend these decision algorithms to the sequential 
setting where the agent alternates between taking actions and observ¬ 
ing their consequences. We find that evidential decision theory has two 
natural extensions while causal decision theory only has one. 

Keywords. Evidential decision theory, causal decision theory, causal graphi¬ 
cal models, planning, dualism, physicalism. 


1 Introduction 

In artificial-intelligence problems an agent interacts sequentially with an envi¬ 
ronment by taking actions and receiving percepts |RN10) . This model is dual¬ 
istic: the agent is distinct from the environment. It influences the environment 
only through its actions, and the environment has no other information about 
the agent. The dualism assumption is accurate for an algorithm that is play¬ 
ing chess, go, or other (video) games, which explains why it is ubiquitous in 
AI research. But often it is not true: real-world agents are embedded in (and 
computed by) the environment [ORI2] , and then a physicalistic mode^ is more 
appropriate. 

This distinction becomes relevant in multi-agent settings with similar agents, 
where each agent encounters ‘echoes’ of its own decision making. If the other 
agents are running the same source code, then the agents’ decisions are logically 
connected. This link can be used for uncoordinated cooperation [LFY+lJ] . 
Moreover, a physicalistic model is indispensable for self-reflection. If the agent is 
required to autonomously verify its integrity, and perform maintenance, repair, 
or upgrades, then the agent needs to be aware of its own functioning. For this, a 
reliable and accurate self-modeling is essential. Today, applications of this level 
of autonomy are mostly restricted to space probes distant from earth or robots 
navigating lethal situations, but in the future this might also become crucial for 

*The final publication is available at http://link.springer.coin/ 

^Some authors also call this type of model materialistic or naturalistic. 
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Figure 1: The physicalistic model. The hidden state s contains information 
about the agent that is unknown to it. The distribution /r is the agent’s (sub¬ 
jective) environment model, and tt its (deterministic) policy. The agent models 
itself through the beliefs about (future) actions given by its environment model 
/i. Interaction with the environment at time step t occurs through an action a* 
chosen by the agent and a percept et returned by the environment. 


sustained self-improvement in generally intelligent agents |Yud08irBosl4[[SF14al 
IRDT+15| . 

In the physicalistic model the agent is embedded inside the environment, 
as depicted in [Figure f| The environment has a hidden state that contains 
information about the agent that is inaccessible to the agent itself. The agent 
has an environment model that describes the behavior of the environment given 
the hidden state and includes beliefs about the agent’s own future actions (thus 
modeling itself). 

Physicalistic agents may view their actions in two ways: as their selected 
output, and as consequences of properties of the environment. This leads to 
significantly more complex problems of inference and decision making, with 
actions simultaneously being both means to influence the environment and evi¬ 
dence about it. For example, looking at cat pictures online may simultaneously 
be a means of procrastination, and evidence of bad air quality in the room. 

Dualistic decision making in a known environment is straightforward calcu¬ 
lation of expected utilities. This is known as Savage decision theory |Sav72] . For 
non-dualistic decision making two main approaches are offered by the decision 
theory literature: causal decision theory (CDT) [GH781 ILewSIl |Sky82[ |Joy99| 
IWeil2] and evidential decision theory (EDT) |,Ief831 lBriI41 IXhml4] . EDT and 
CDT both take actions that maximize expected utility, but differ in the way this 
expectation is computed: EDT uses the action under consideration as evidence 
about the environment while CDT does not. [Section 2l formally introduces these 
decision algorithms. 

Our contribution is to formalize and explore a decision-theoretic setting with 
a physicalistic reinforcement learning agent interacting sequentially with an en¬ 
vironment that it is embedded in (ISection 31) . Previous work on non-dualistic 
decision theories has focused on one-shot situations. We find that there are 
two natural extensions of EDT to the sequential case, depending on whether 
the agent updates beliefs based on its next action or its entire policy. CDT 
has only one natural extension. We extend two famous Newcomblike problems 
to the sequential setting to illustrate the differences between our (generalized) 
decision theories. 

ISection "dl summarizes our results and outlines future directions. A list of 
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notation can be found on [page 16| and [Appendix A| contains formal details to 
our examples. 


2 One-Shot Decision Making 


In a one-shot decision problem, we take one action a G A, receive a percept e G £ 
(typically called outcome in the decision theory literature) and get a payoff u{e) 
according to the utility function m : f [0,1]. We assume that the set of actions 
A and the set of percepts £ are finite. Additionally, the environment contains 
a hidden state s G S. The hidden state holds information that is inaccessible 
to the agent at the time of the decision, but may influence the decision and the 
percept. Formally, the environment is given by a probability distribution P over 
the hidden state, the action, and the percept that factors according to a causal 
graph |Pea09| . 

A causal graph over the random variables Xi,...,Xn is a directed acyclic 
graph with nodes Xi,... ,Xn- To each node Xi belongs a probability distribution 
P{xi I pa^), where pa^ is the set of parents of Xi in the graph. It is natu¬ 
ral to identify the causal graph with the factored distribution P{xi ,... ,Xn) = 
nr=i I P®i)- Given such a causal graph/factored distribution, we define 
the do-operator as 


n 


P{xi,.. .,Xj-i,Xj+i, ...,Xn I do(a;j := b)) = | paj (1) 



where Xj is set to b wherever it occurs in pa^, 1 < i < n. The result is a 
new probability distribution that can be marginalized and conditioned in the 
standard way. Intuitively, intervening on node Xj means ignoring all incoming 
arrows to Xj , as the effects they represent are no longer relevant when we inter¬ 
vene; the factor P{xj \ pOj) representing the ingoing influences to Xj is therefore 
removed in the right-hand side of O- Note that the do-operator is only defined 
for distributions for which a causal graph has been specified. See IPeaOQl Ch. 
3.4] for details. 

2.1 Savage Decision Theory 

In the dualistic formulation of decision theory, we have a function P that takes 
an action a and returns a probability distribution Pa over percepts. Savage 
decision theory (SDT) |Sav72l IBrildj takes actions according to 



(SDT) 


In the dualistic model it is usually conceptually clear what Pa should be. 
In the physicalistic model the environment model takes the form of a causal 
graph over a hidden state s, action a, and percept e, as illustrated in |Figure~2| 
According to this causal graph, the probability distribution P factors causally 
into P{s, a, e) = P{s)P{a \ s)P{e \ s, a). The hidden state is not independent of 
the decision maker’s action and Savage’s model is not directly applicable since 
we do not have a specification of Pa- How should decisions be made in this 
context? The literature focuses on two answers to this question: CDT and 


EDT. 
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Figure 2: The causal graph P{s,a,e) = P{s)P{a \ s)P{e \ s,a) for one-step 
decision making. The hidden state s influences both the decision maker’s action 
a and the received percept e. 


2.2 Causal and Evidential Decision Theory 

The literature on causal and evidential decision theory is vast, and we give only a 
very superficial overview that is intended to bring the reader up to speed on the 
basics. See [Bril4[rWeil2) and references therein for more detailed introductions. 

Evidential decision theory (endorsed in |,Tef881 lAhmldj l considers the prob¬ 
ability of the percept e conditional on taking the action a: 


argmax^^P(e | a)u(e) 

eGf: 


with P(e I a) = E P(els,a)P(sla) 

sGS 


(EDT) 


Causal decision theory has several formulations |GH781lLew81l[Sky82[|Joy99| ; 
we use the one given in |Sky82| , with Pearl’s calculus of causality |Pean9) . Ac¬ 
cording to CDT, the probability of a percept e is given by the causal intervention 
of performing action a on the causal graph from [Figure 2| 


arg max P(e I do(a)) M(e) with P(e | do(a)) = E P{e I s,a)P{s) 

ee£ sG5 

(CDT) 

where P{e \ do(a)) follows from ([T|) and marginalization over s. 

The difference between CDT and EDT is how the action affects the belief 
about the hidden state. EDT assigns credence P(s | a) to the hidden state s if 
action a is taken, while CDT assigns credence P{s). A common argument for 
CDT is that an action under my direct control should not influence my belief 
about things that are not causally affected by the action. Hence P(s) should be 
my belief in s, and not P(s | a). (By assumption, the action does not causally 
affect the hidden state.) EDT might reply that if action a does not have the 
same likelihood under all hidden states s, then action a should indeed inform 
me about the hidden state, regardless of causal connection. The following two 
classical examples from the decision theory literature describe situations where 
CDT and EDT disagree. A formal definition of these examples can be found in 
[Appendix A[ 

Example 1 (Newcomb’s Problem [Noz69p . In Newcomb’s Problem there are 
two boxes: an opaque box that is either empty or contains one million dollars 
and a transparent box that contains one thousand dollars. The agent can choose 
between taking only the opaque box (‘one-boxing’) and taking both boxes (‘two- 
boxing’). The content of the opaque box is determined by a prediction about 
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the agent’s action by a very reliable predictor: if the agent is predicted to one- 
box, the box contains the million, and if the agent is predicted to two-box, the 
box is empty. In Newcomb’s problem EDT prescribes one-boxing because one- 
boxing is evidence that the box contains a million dollars. In contrast, CDT 
prescribes two-boxing because two-boxing dominates one-boxing: in either case 
we are a thousand dollars richer, and our decision cannot causally affect the 
prediction. Newcomb’s problem has been raised as a critique to CDT, but 
many philosophers insist that two-boxing is in fact the rational choice^ even if 
it means you end up poor. 

Note how the decision depends on whether the action influences the belief 
about the hidden state (the contents of the opaque box) or not. 


Newcomb’s problem may appear as an unrealistic thought experiment. How¬ 
ever, we argue that problems with similar structure are fairly common. The 
main structural requirement is that P{s \ a) ^ D(s) for some state or event 
s that is not causally affected by a. In Newcomb’s problem the predictor’s 
ability to guess the action induces an ‘information link’ between actions and 
hidden states. If the stakes are high enough, the predictor does not have to 
be much better than random in order to generate a Newcomblike decision prob¬ 
lem. Consider for example spouses predicting the faithfulness of their partners, 
employers predicting the trustworthiness of their employees, or parents predict¬ 
ing their children’s intentions. For AIs, the potential for accurate predictions 
is even greater, as the predictor may have access to the AFs source code. Al¬ 
though rarely perfect, all of these predictions are often substantially better than 
random. 

To counteract the impression that EDT is generally superior to CDT, we 
also discuss the toxoplasmosis problem. 

Example 2 (Toxoplasmosis Problem |Altl3] l. d This problem takes place in a 
world in which there is a certain parasite that causes its hosts to be attracted to 
cats, in addition to uncomfortable side effects. The agent is handed an adorable 
little kitten and is faced with the decision of whether or not to pet it. Petting the 
kitten feels nice and therefore yields more utility than not petting it. However, 
people suffering from the parasite are more likely to pet the kitten. Petting the 
kitten is evidence of having the parasite, so EDT recommends against it. CDT 
correctly observes that petting the kitten does not cause the parasite, and is 
therefore in favor of petting. 


Newcomb’s problem and the toxoplasmosis problem cannot be properly for¬ 


malized in SDT, because SDT requires the percept-probabilities Pa to be speci¬ 
fied, but it is not clear what the right choice of Pa would be. However, both CDT 
and EDT can be recast in the context of ISDTI bv setting Pa to be P( • | do(a)) 
and P{- I a) respectively. Thus we could say that the formulation given by 
Savage needs a specification of the environment that tells us whether to act 
evidentially, causally, or otherwise. 


^In a 2009 survey, 31.4% of philosophers favored two-boxing, and 21.3% favored one-boxing 
(931 responses); see http://philpapers.org/surv6ys/results.pl Is that the reason there 
are so few wealthy philosophers? 

®Historically, this problem has been known as the smoking lesion problem |Ega07| . We 
consider the smoking lesion formulation confusing, because today it is universally known that 
smoking does cause lung cancer. 
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Figure 3: The (infinite) causal graph for a sequential environment. Each action 
at and each percept et is represented by a node in the causal graph. Actions 
and percepts affect all subsequent actions and percepts: causality follows time. 
The hidden state s is only ever indirectly (partially) observed. 


3 Sequential Decision Making 


In this section we extend CDT and EDT to the sequential case. We start by 
formally specifying the physicalistic model depicted in [Figure 1 in the first sub¬ 
section, and discuss problems with time consistency in ISection 3.21 before defin¬ 
ing the extensions proper in ISection .01 and 13.41 The Ifinal subsectioni dissects 
the role of the hidden state. 


3.1 The Physicalistic Model 

For the remainder of this paper, we assume that the agent interacts sequentially 
with an environment. At time step t the agent chooses an action at G A and 
receives a percept et € £ which yields a utility of u{et) € M; the cycle then 
repeats for t -|- 1. A history is an element of {A x £)*. We use ae € A x £ to 
denote one interaction cycle, and «<( to denote a history of length t — 1. The 
percepts between time t and time m are denoted et-m- A policy is a function 
that maps a history £e<t to the next action a*. We only consider deterministic 
policies. 

We assume that the agent is given an environment model p, but knows 
neither the hidden state s nor its own future actions. The unknown hidden 
state may influence both percepts and actions. Actions and percepts in turn 
influence the entire future. The environment model p is given by a probability 
distribution over hidden states and histories that factors as 


t-i 

p{s, sect) = p{s) p{ai I s, se<^)p{ei \ s, ee<^a^) (2) 

i=l 

for any t G N. While such a factorization is possible for any distribution, we ad¬ 
ditionally demand that this factorization is causal according to the causal graph 
in |Figure~3l The distribution p{at \ s,se^t) gives the likelihood of the agent’s 
own actions provided a hidden state s G S (for example, the prior probability of 
an infected agent petting the kitten in the toxoplasmosis problem above). For 
technical reasons, this distribution must always leave some uncertainty about 
the actions: if the environment model assigned probability zero for an action a', 
the agent could not deliberate taking action a' since a' could not be conditioned 


6 























on. Formally, we require /i( • | s) to be action-positive for all s € 5: 

€ {Ax £)* X A. [pL{se<,t | s) > 0 p{at \ s,aect) > O) (3) 

The distribution /x is a model of the environment, a belief held by the agent, 
but not the distribution from which the actual history is drawn. The actual 
history is distributed according to the true environment distribution. Because 
the environment contains the agent, the agent’s algorithm might get modified 
by it and the actions that the agent actually ends up taking might not be the 
actions that were planned. In the end, model and reality will disagree: for 
example, we simultaneously assume the agent’s policy tt to be deterministic 
and the environment model to be action positive. Nevertheless, we assume the 
given environment model is accurate in the sense that it faithfully represents 
the environment in the ways relevant to the agent. In other words, we are 
interested in problems that arise during planning, not problems that arise due 
to poor modeling. 

3.2 Time Consistency 

When planning for the infinite future we need to make sure that utilities do not 
sum to infinity; typically this is achieved with discounting. Here, we simplify 
by fixing a finite m S N to be the agent’s lifetime: the agent cares about the 
sum of the utilities of all percepts ei... Cm until and including time step m, but 
does not care what happens after that (presumably the agent is then retired). 

In sequential decision theory we need to plan the next m — t actions in 
time step t. We plan what we would do for all possible future percepts et-.m by 
choosing a policy tt : {Ax£)* ^ A that specifies which action we take depending 
on how the history plays out. For example, we take action at, and when we 
subsequently receive the percept Ct, we plan to take action at+i. Problems arise 
once we get to the next step and even tough we did take action at and the 
percept did turn out to be et, we change our mind and take a different action 
dt+i. This is called time inconsistency. Time inconsistency is an artifact of bad 
planning since the agent incorrectly anticipates her own actions. The choice of 
discounting can lead to time inconsistency: a sliding fixed-size horizon is time 
inconsistent, but a fixed finite lifetime is time consistent |LH14] . 

We achieve time consistency by using a fixed finite lifetime, and by calculat¬ 
ing decisions recursively using value functions. A value function is a func¬ 
tion of type ((Axf )*U((Axf )* x A)) —>■ R. It gives an estimate of future reward: 

estimates of how much reward the policy tt will 
obtain in environment p within lifetime m subsequent to history ge<i and se,^tat 
respectively. For any history ge<t, we define •= 

We say that a policy tt is optimal and time consistent for the value function 
iff 7r(£e<t) = argmaxa ld](„(ge<ta) for all histories £e<t & {Ax and 

all t < m. 

3.3 Sequential Evidential Decision Theory 

Evidential decision theory assigns probability P{e \ a) to action a resulting in 
percept e (ISection 2.21) . There are two ways to generalize this to the sequential 
setting, depending on whether we use only the next action or the whole future 
policy as evidence for the next percept. 
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Definition 3 (Action-Evidential Decision Theory). The action-evidential value 
of a policy tt with lifetime m in environment p, given history se<c,tO-t is 

:= | {ee^tatet)) (SAEDT) 

et 

and := 0 for t > m. Sequential Action-Evidential Decision The¬ 

ory (SAEDT) prescribes adopting an optimal and time consistent policy tt for 

^aev 
* fi,m ■ 

It may be argued that ISAEDTl does not take all available (deliberative) 
information into account. When considering the consequences of an action, 
future developments of the environment-policy interactions could also be used 
as evidence. That is, we could condition not only on the next action, but on 
the future policy as a whole (within the lifetime). In order to define conditional 
probabilities with respect to (deterministic) policies, we define the following 
events. For a given policy tt, let Hf.m be the set of all strings consistent with tt 
between time step t and m: 

IIi:m ■— {^l:oo | '^t I TTL. 7r(a?<;i) — Uz} 

The likelihood of a next percept et provided a history and a (future) policy 
TT followed from time step t until lifetime m (denoted TTf.m) is then defined as 

pi^Ct I I fl IIi:m)- (4) 

This is an atemporal conditional because we are conditioning on future actions 
up until the end of the agent’s lifetime. The conditional (|3]) is well-defined 
because we only take the actions from time step ttom into account; conditioning 
on policies with infinite lifetime leads to technical problems because such policies 
typically have /i-measure zero. 

Definition 4 (Policy-Evidential Decision Theory). The policy-evidential value 
of a policy tt with lifetime m in environment p given history £e<tat is 

:='^p{et \ ■ (w(et) -h V'P^’’'(cB<tatet)) 

et 

(SPEDT) 

and V)Pm’’^(®<t) •= 0 for t > m. Sequential Policy-Evidential Decision Theory 
(SPEDT) prescribes adopting an optimal and time consistent policy tt for 

For one-step decisions [m = t-\- 1), SAEDT and SPEDT coincide. 

To all our embedded agents, past actions constitute evidence about the hid¬ 
den state. For evidential agents, this principle is extended to future actions. 
SAEDT and SPEDT differ in how far they extend it. The action-evidential 
agent only updates his belief on the action about to take place. In that sense, 
he only updates his belief about the next percept on events taking place before 
this percept. The policy-evidential agent takes the principle much further, using 
“thought-experiments” of what action he would take in hypothetical situations, 
most of which will never be realized. This is illustrated in the next example. 

Example 5 (Sequential Toxoplasmosis). In our sequential variation of the 
toxoplasmosis problem| the agent has some probability of encountering a kit¬ 
ten. Additionally, the agent has the option of seeing a doctor (for a fee) and 
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Figure 4: One formalization of the sequential toxoplasmosis problem. Dashed 
lines connect states indistinguishable to the agent. The numbers on the edges 
indicate probabilities of the environment model /r, and the numbers in parenthe¬ 
sis indicate utilities of the associated percepts. In the first step, the environment 
selects the hidden state that is unknown to the agent. The agent then decides 
whether to go to the doctor. If he does not go, he may encounter a kitten which 
he can choose to pet or not. SAEDT and SPEDT will disagree whether going 
to the doctor is the best option in this scenario. [Appendix A| contains the full 
calculations. 


getting tested for the parasite, which can then be safely removed. In the very 
beginning, an SPEDT agent updates his belief on the fact that if he encoun¬ 
tered a kitten, he would not pet it, which lowers the probability that he has 
the parasite and makes seeing the doctor unattractive. An SAEDT agent only 
updates his belief about the parasite when he actually encounters a kitten, and 
thus prefers seeing the doctor. See [Figure 4] for more details and a graphical 
illustration. 

The observant reader may ask whether SPEDT could be enticed to make 
some percepts unlikely by choosing improbable actions subsequent to them. 
For example, could an SPEDT agent decide on a policy of selecting highly 
improbable actions in case it rained to make histories with rain less likely? The 
answer is no, as most such policies would not be time consistent. If it does rain, 
the highly improbable action would usually not the best one, and so the policy 
would not be prescribed by IDefinition 41 

3.4 Sequential Causal Decision Theory 

In sequential causal decision theory we ask what would happen if we causally 
intervened on the node at of the next action and fix it to 7r(£e<t) according to 
the policy tt. This is expressed by the notation do(at := 7r(£e<t)), or do(7r(ge<t)) 
for short. 

Definition 6 (Sequential Causal Decision Theory). The causal value of a policy 
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TT with lifetime m in environment /i given history ce<^t0.t is 

:= XI I se<t,do{at))(uiet) + {ee<tatet)^ (SCDT) 

et 

and ■= 0 for t > to. Sequential Causal Decision Theory (SCDT) 

prescribes adopting an optimal and time consistent policy tt for Vf)))). 

For sequential evidential decision theory we discussed two versions (ISAEDT|1 
and (ISPEDTIl . based on next action and future policy respectively. In ISCDTl 
we perform the causal intervention do(at := 7r(as<t)). We could also consider 
a policy-causal decision theory by replacing ^(et | £e<t,do(at)) with /r(et | 
£e<t, do(7rt:m)) in IDefinition 61 The causal intervention do(7rt:j„)) of a policy 
TT between time step t and time step to is defined as as 


H{et I *<t,do(7rt :m )) -E I a9<t,do(at 7r(ffi<t ),... ,am ■— 7r(*<m))). 

(5) 

However, since the interventions are causal, we do not get any extra evidence 
from the future interventions. Therefore policy-causal decision theory is the 
same as action-causal decision theory: 

Proposition 7 (Policy-Causal = Action-Causal). For all histories oe<t G (.4 x 
£)* and all et G S, we have yL{et \ oe<t, do(7rt:m)) = M(et | ce<t, do(7r(£e<t))). 


We defer the proof to the end of this section. The following two exam¬ 
ples illustrate the difference between SCDT and SAEDT/SPEDT in sequential 
settings. 


Example 8 (Newcomb with Precommitment). In this variation to |Newcomb’s problem] 
the agent first has the option to pay $300,000 to sign a contract that binds the 
agent to pay $2000 in case of two-boxing. An SAEDT or SPEDT agent knows 
that he will one-box anyways and hence has no need for the contract. An SCDT 
agent knows that she favors two-boxing, but signs the contract only if this oc¬ 
curs before the prediction is made (so it has a chance of causally affecting the 
prediction). With the contract in place, one-boxing is the dominant action, and 
thus the SCDT agent is predicted to one-box. 


Example 9 (Newcomb with Looking). In this variation to Newcomb’s problem 


the agent may look into the opaque box before making the decision which box 
to take. An SCDT agent is indifferent towards looking because she will take 
both boxes anyways. However, an SAEDT or SPEDT agent will avoid looking 
into the box, because once the content is revealed he two-boxes. 


3.5 Expansion over the Hidden State 

The difference between sequential versions of EDT and CDT is how they update 
their prediction of a next percept e* (Definitions [Sj |4]and[6]). The following 
proposition expands the different beliefs in terms of the hidden state. 
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Proposition 10. For all histories a9<tatet G (.4 x £)* the following holds for 
the next-percept beliefs of SAEDT, SPEDT and SCOT respectively: 

p.{et I £e<tat) = ^/i(s | ae<tat)p.{et \ s,ae<tat) (6) 

s^S 

I TTirm) — ^ ^ I I ^<t: (7) 

s^S 

I «<t,do(at)) = ^/i(s I se<t)ti-{et \ s,ae<tat) (8) 

sG5 


Proof. For the action-evidential conditional we take the joint distribution with 
s, and then split off ep. 


fi{et I £e<tat) 


X/seS m('®) 88^tatCt) 
n{ee<tat) 


Y.s&sl^i^^^<t^t)h{et I s,se<tat) 

pL{£B<tat) 

I 8e<tat)pi{et \ s,£B<tat) 

sGS 


Similarly for the policy-evidential conditional: 

, I ^ EsesAi(s, «<t7r(a9<t)et,7rt+i:^) 

fj/{^8d<t , TTf- jYi j 

_ | s, a3<t7r(as<t), 7rt+i:m) 

/r(as<t,7rt 

:m) 

_ J2sgs l^i^’ I S,a;<t7r(aj<t),7rt+1;^) 

/r(a9<t,7rt 

:m) 

= I a?<t,7rt:m)^(et | s, a;<47r(a;<t), TTt+i:™) 

s£S 


For the causal conditional we turn to the rules of the do-operator |Pea091 
Thm. 3.4.1]. The first equality below holds by definition. In the denominator 
of the second equality we can use Rule 3 (deletion of actions) to remove do (a*) 
because the do-operator removes all incoming edges to a* and makes at inde¬ 
pendent of the history as<t. In the numerator of the second equality we use the 
definition of do O: 


p.{et I as<t,do(at)) 


/r(a3<t,et | do(at)) 

^(*<t I do(at)) 

Esgg a^<t)M(et I s,ae<tat) 

I «<t)/r(et I s,se<tat) 

s^S 


□ 


[Proposition 10| shows that between SCDT and SAEDT, the difference in 
opinion about et only depends on differences in their (acausal) posterior belief 
Pl{s I ...) about the hidden state. SCDT and SAEDT thus become equivalent in 
scenarios where there is only one hidden state s* with p.{s*) = 1, as this renders 
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fi{s* I «<() = n{s* I £e<tat) = n{s*) = 1. SPEDT, on the other hand, may 
disagree with the other two also after a hidden state has been fixed. 

From a problem modeler’s perspective, it is also instructive to consider 
the effect of moving uncertainty between the hidden state and environmen¬ 
tal stochasticity. For two different environment models /r and /r', the action 
and percept probabilities may be identical (i.e., fi{at \ ce<t) = | *<*) and 

/r(et I ae<^tat) = \ a9<tat)) even though ^ and /i' have non-isomorphic sets 

of hidden states S and S'. For example, given any /r, an environment model 
p,' with a single hidden state Sq, /i^(so) = Ij may be constructed from /r by 
/r'(so, £e<t) := *<*)■ The transformation will not affect SAEDT and 

SPEDT, as the definitions of their value functions only depends on the ‘ob¬ 
servable’ action- and percept-probabilities /r(at | and /i(et | ee^to-t) which 

are preserved between /i and /ib But the transformation will change SCDT’s 
behavior in any /r where SCDT disagrees with SAEDT, as SCDT and SAEDT 
are equivalent in /r' that only has a single hidden state. That SCDT depends 
on what uncertainty is captured by the hidden state is unsurprising given that 
the hidden state has a special place in the causal structure of the problem. Ul¬ 
timately, the modeler must decide what uncertainty to put in the hidden state, 
and what to attribute to environmental stochasticity. A general principle for 
how to do this is still an open question |SF14b) . 

The value functions of SAEDT, SPEDT and SCDT can be rewritten in the 
following iterative forms, where the latter form uses |Proposition 10[ Numbers 
above equality signs reference a justifying equation. Let ai := 7r(£e<i) for i >t: 

m k 

EE u{ek)Y[tJ.{ei\ ee<^a^) (9) 

k—t et:k i=t 

m k 

^EE ■u(efe) HE n{s \ ae^iai)n{ei \ s,ae<iai) (10) 

k—t et:k i—t s^S 

m k 

’’"(«<() = E E n I *<*’ 

k—t et,k i—t 

m k 

^EE ■u(efe) HE IJ,{s I \ S ^ Sdc^i^ TTi-^n) (1^) 

k—t et,k i—t s^S 

m k 

EE u{ek)Y[ii{ei\ se<i,do{a,)) (13) 

k—t et,k i—t 

m k 

= E E '“(®*) n E b‘(® I I s, seaaf) (14) 

k=t et,k i=t sSS 
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ISAEDTl 

ISPEDTl 

ISCiypI 

Nwcb[ 


1-box 

1-box 

2-box 

Nwcb w/ precommit 

not commit, 1-box 

not commit, 1-box 

commit, 1-box 

Nwcb w/ looking 


not look, 1-box 

not look, 1-box 

indifferent, 2-box 

Toxoplasmosisl 


not pet 

not pet 

pet 

Seq. Toxoplasmosisl 

doc, not pet 

no doc, not pet 

doc, pet 


Table 1: Decisions made by ISAEDTl ISPEDTl and ISCDTI in [ExampleT] 
Example 2 Example 5[ Example 8 and Example 9 The latter three exam¬ 
ples are sequential. Winning moves are in italics; in Newcomb with looking the 
winning move is to be indifferent and one-box. Because Savage decision theory 
is dualistic, these problems cannot be properly formalized in it. 


Proof of \Proposition ?[ By the definition @ of do(7rt:m), 

H{et I *<t, do(7rt:m)) = ^ p.{et:m I «<t, do(at := 7r(*<t), ...,am-= 7r(*<m))) 

= ^ ^(s I ae<t)Ket:m \ s, «<t, do(7r(*<t),..., 7r(a?<m))) 

{TJ ^ 

= X! I I s, «<i7r(a;<i)) 

i — t 

= ^ /r(s I cB<t)Ket I s, *<t7r(a;<t)) 

S 

= fi{et I cK<t,do(7r(a?<t))) 

The second equality follows from the equivalence P{-) = ' I '®) 

applied to the distribution ^( • | £e<t,do(at := 7r(£e<t),..., Om := 7'‘(£e<m))), 
and the third equality by (repeated) application of ([T]) to \ = 

I\T=t I ■S: I S, £B<iai). □ 


4 Discussion 


Our paper is a first stab at the problem of how physicalistic agents should make 
sequential decisions. ICDTI and lEDTI provide an existing basis for non-dualistic 
decision making, which we extended to the sequential setting. There are two 
natural ways for making sequential evidential decisions: do I update my beliefs 
about the hidden state based on my next action (‘what I do next’, ISAEDTl) 
or my whole policy (‘the kind of agent I am’, ISPEDT(l ? By [Proposition 7] this 
distinction does not exist for causal decision theory, because with that theory 
the agent does not consider its own actions evidence at all. Therefore we have 
only one version of sequential causal decision theory, ISCDTI 

To illustrate the differences between the decision theories, we discussed three 
variants of Newcomb’s problem ([Example 1[ Example~8l and Example~9l) and 


two variants of the toxoplasmosis problem (Example 2 

and Exam 

rle 5). The 

formal specification of these examples can be found in 7 

Appendix A 

We imple- 


mented ISCDTI ISAEDTl and ISPEDTl ITable II shows their behavior on those 
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examples 0 

So which decision theory is better? The answer to this question depends 
on which decision you consider to be correct (or even rational) in each of the 
problems. We posit that ultimately, what counts is not whether your decision 
algorithm is theoretically pleasing, but whether you win. Winning means getting 
the most utility. If maximizing utility involves making crazy decisions, then this 
is what you should do! 

In Newcomb’s problem, winning means one-boxing, because you end up 
richer. In the toxoplasmosis problem, winning means petting the kitten, be¬ 
cause that yields more utility. (S)CDT performs suboptimally in the Newcomb 
variations, while the evidential decision theories perform suboptimally in the 
toxoplasmosis variations. This entails that neither CDT nor EDT are the final 
answer to the problem of non-dualistic decision making. 

Furthermore, neither CDT nor EDT agents are fully physicalistic: they do 
not model the environment to contain themselves [SF14b) . For example, when 
playing a prisoner’s dilemma against your own source code |SF15] . your oppo¬ 
nent defects if and only if you defect. This logical connection between your 
action and your opponent’s is disregarded in the formalization based on causal 
graphical models that we discuss here because it is not causal. 

Timeless decision theory [YudlO] and updateless decision theory |SF14b) are 
recent attempts of more physicalistic decision theories. However, so far both 
have eluded explicit formalization |SF15) . We conclude that finding a physical¬ 
istic decision theory remains an important open problem in artificial intelligence 
research. 
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DP120100950. It started at a MIRIxCanberra workshop sponsored by the Ma¬ 
chine Intelligence Research Institute. Mayank Daswani and Daniel Filan con¬ 
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List of Notation 

:= defined to be equal 

N the natural numbers, starting with 0 

R the real numbers 

e a small positive real number 

A the (finite) set of possible actions 

£ the (finite) set of possible percepts 

S the set of hidden states 

u the utility function m : f —>■ [ 0 , 1 ] 

at the action in time step t 

et the percept in time step t 

ee^t the first t — \ interactions, 01610262 ... at_i 6 t_i 

Xi:k the interactions between and including time step i and time step k, 

. . . CLf^Cf^ 

£ 61:00 a history of infinite length 
s a hidden state 

TT a deterministic policy, i.e., a function tt : {Ax £)* A 

TTf.k policy TT restricted to the time steps between and including t and k 
action-evidential value of policy tt in environment p, up to time step 
m, defined in (ISAEDTIl 

policy-evidential value of policy tt in environment p up to time step 
m, defined in (ISPEDTI) 

causal value of policy tt in environment p up to time step m, defined 
in (IS(IDTI) 

k, i time steps, natural numbers 

t (current) time step 

m lifetime of the agent 

Pa distribution over percepts induced by action o in ISDTI 

P distribution over percepts and actions in one-shot decision making 

p an accurate environment model 


16 









A Examples 

This section contains the formal calculations for |Example l[[Example 2l|Example 5[ 
|Example~8l and |Example 9| These calculations are also available as Python code 
at http://jan.leike.name/. 


Example 11 (Newcomb’s Problem). This is a formalization of Example 1 


• S := {E,F} where E means the opaque box is empty and E means the 
opaque box is full 

• A:= {Si, B 2 } where Bi means one-boxing and B 2 means two-boxing 

• £ := {Oo,Ot,Om,Omt} 

• m(Oo) := 0, u{Ot) ■■= 1,000, u(CIm) := 1,000,000, u{Omt) ■= 1,001,000 

Let e > 0 be a small constant denoting the accuracy of the predictor. Because 
the environment has to assign non-zero probability to all actions, e must be 
strictly positive. The environment’s distribution ^ is defined as follows. 


fi{E) = fi{E) = 0.5 
m(Si I F) = \E) = l-e 

I E) = ^i{B 2 \F)=e 


By Bayes’ rule. 


fi{F I Si) 


/r(Si I F)fi{F) 

I s)/i(s) 


|S,S2) = 1 
m(Oo |S,Si) = l 

fi{OMT I S, S2) = 1 
/x(Om I s. Si) = 1 


which also gives ^{E \ Si) = e. Similarly, /r(S | S 2 ) = e and fi{E \ S 2 ) = 1 — e. 

For EDT we use equation (lEDTIl to compute the value of an action. Since 
the percept ei is generated deterministically, /i(e | s,a) only attains values 0 or 
1. We therefore omit it in the calculation below. For action Si we get 


e^E e^Es^S 

= ^l{E I Si)u(Oo) + m(S I Si)u(Om) 

= £-0-b(l-£) • 1,000,000 

For action S 2 we get 

:= I ■®2)M(e) = I S’-®2)Ai(s | S2)u(e) 

e^E e^Es^S 

= /r(S I B2)u{Ot) + ^J'iF I B2)u{Omt) 

= (1 -e) • 1,000-b£- 1,001,000 
= 1,000-b£- 1,000,000 

For £ < 49.95 (just slightly better than random guessing), we get that EDT 
favors Si over S 2 : 

= (1 _ e). 000,000 > 500,500 > 1,000 -b £ • 1,000,000 = 
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For CDT we use equation (ICDT|) to compute the value of an action. For 
action Bi we get 

H I do(Bi))u(e) = ^ ^ M(e I s,Bi)fi{s)u{e) 

cGE cGEs^S 

= n{E)u{Oo) + fi{F)u{OM) 

= 0.5 ■ 0 + 0.5 • 1, 000,000 = 500,000 

For action B2 we get 

:= I doiB2))u{e) = ^ ^ ^i{e I s,B 2 )fJ.{s)u{e) 

e^E e^Es^S 

= ^i{E)u{Ot) + ^i{F)u{OMT) 

= 0.5 • 1, 000 + 0.5 • 1, 001,000 = 500,500 

We get that CDT favors B2 over Bi regardless of the prediction accuracy e: 

= 500,000 < 500,500 = 

Moreover, CDT prefers B2 regardless of the prior over p{E). Two-boxing is the 
dominant action because it yields $1,000 more regardless of the hidden state. 


Example 12 (Newcomb with Looking). This is a formalization of [Example 9| 
it extends [Example 11| 

In the first time step, the agent gets to choose between looking into the box 
(L) and not looking {N). If the agent looks, the subsequent percept will be E 
or F, depending on whether the box is empty {E) or full (F). If the agent does 
not look, the subsequent percept will be 0. All three of these percepts E, F, 
and 0 have zero utility. 

In the second time step the agent chooses to one-box (Bi) or to two-box 
{B 2 ). The payoffs are then based on the boxes’ contents as in [Example 11[ 

• S := {E,F} where E means the opaque box is empty and F means the 
opaque box is full 

• A '.= {Bi,B 2 } where Bi means one-boxing and B 2 means two-boxing, 
L Bi means looking into the box and N := B 2 means not looking (the 
set of actions has to be the same for all time steps) 

• £ := {E, F, 0, Oo, Or, Om, Omt} 

• u{Oo) ■■= 0, u{Ot) := 1,000, u{Om) ■= 1,000,000, u{Omt) := 1,001,000, 
u{E) ■= u{F) := u{0) := 0 

Let e > 0 be a small constant denoting the prediction accuracy. Because the 
environment has to assign non-zero probability to all actions, e must be strictly 
positive. The environment’s distribution p, is defined as follows. Question marks 
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stand for single actions or percepts whose value is irrelevant. 



fi{E) = f,{F) = 0.5 

fi{E\E,L) = l 

/i(L 

1 F) = /i(L 1 E) = 0.5 

fiiO\E,N) = l 

M(fV| 

F) = n{N 1 E) = 0.5 

^^{F\F,L) = 1 


/r(i?i !£;,??) = e 

fiiO\F,N) = l 


/r(i?i 1 F, ??) = 1 - 6 

tiiOo\E,77Bi) = l 


/r(U2 1 F, ??) = 1 - 6 

fiiOr 1 E, ??B2) = 1 


|F,??)=£ 

fi{OM |F,??Bi) = l 
^i{OMT F, ??i?2) = 1 


The environment’s game tree is given as follows, where dashed lines connect 
states indistinguishable by the agent (also known as information sets): 



1 

n 

s 

B 2 

- 1,001,000 


/ 

1 — e 

Bi 

- 1,000,000 

1 


e 

B 2 

- 1,001,000 

1 

1 ' 

1 

1 


1 — e 

Bi 

- 1,000,000 


\ 

1 — e 



1 

n 


B 2 

- 1,000 



e 

Bi 

- 0 



1 — e 




■p 


B 2 

- 1,000 

1 


s 

Bi 

- 0 


Using Bayes’ rule, we calculate the following conditional probabilities of the 
hidden state given a history oi or 016102 : 


0.5 = fi{E I L) = fi{F I L) = fi{E I N) = ^{E \ N) 

1 = I LEBi) = n{E I LEB 2 ) = fJ-iF I LFBi) = ^i{F \ LFBi) 

e = ^ji{E\ mBi) = fi{F I iVOBa) 
l-s = n{E\ NOB 2 ) = tJ.{F I iVOBi) 

Next, we write out the formula for lSAEDTI for a horizon of 2 based on m- 
The first percept has no utility, which simplifies the equation. 

I I s>«i)) I «ia2)M(e2 | s,*ia2) j 

ei:2 \sG5 / \sG5 / 


where oi = 7r(e) and 02 = 7r(£ei). The formula for ISPEDTI for a horizon of 2 
based on (EH) is as follows. 


^pev,. 


^ 0 ( 62 ) 


/r(saiei7r(oiei)) 
J2seS Seg£ ^(soie7r(oie)) 


I 391712)^^(62 I S,*ia2) 

s£S 
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with 7ri:2 and 112 defined according to (|1]). The formula for lSCDTl for a horizon 
of 2 based on (ITTl) is as follows. 


T 7-cau,7r 

V.2 


^(s)/r(ei I s,ai) 


\s^S 


I sei)^{e2 I s,*ia2) 

\sG5 


where oi = 7r(e) and 02 = 7r(sei). 

There are six different possible policies: 

• Look and always one-box (curious one-boxer) 

• Look and always two-box (curious two-boxer) 

• Don’t look and one-box (incurious one-boxer) 

• Don’t look and two-box (incurious two-boxer) 

• Look and one-box iff the box is empty (paradox-lover) 

• Look and one-box iff the box full (fatalistic) 

Using the formulas above we can calculate their value. We use e := 0.01. 



T raev,7r 
^11,2 

T/pev.TT 

V.2 

T rcau,7r 

Curious one-boxer 

500,000 

990,000 

500,000 

Curious two-boxer 

501,000 

11,000 

501,000 

Incurious one-boxer 

990,000 

990,000 

500,000 

Incurious two-boxer 

11,000 

11,000 

501,000 

Paradox-lover 

500,500 

500,500 

500,500 

Fatalistic 

500,500 

500,500 

500,500 


The highest values are displayed in italics. The incurious one-boxer has the 
highest action-evidential value. The curious one-boxer and the incurious one- 
boxer have the highest policy-evidential value. However, of these two policies 
only the incurious one-boxer is a time-consistent policy for SPEDT, because the 
agent wants to two-box after looking into the box: 

V^^'^\LF) = V;fJ’'®^(LF) = 1,000,000 
y^T'^'iLF) = = 1,001,000 

= U^Pf’'®^(LE) = 0 

(LE) = 1,000 

The curious two-boxer and the incurious two-boxer have the highest causal 
value, and they are both time-consistent for SCDT. 


Example 13 (Newcomb with Precommitment). This is a formalization of 
|Example~8l it extends [Example 11| 

In the first time step, the agent gets to choose between signing the contract 
(S) and not signing {N). If the agent signs, the subsequent percept will be C, 
which costs $300,000, and the prediction will be updated to one-boxing. If the 
agent does not sign, the subsequent percept will be 0 with zero utility. 
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In the second time step the agent chooses to one-box (i?i) or to two-box 
{B 2 ). The payoffs are then based on the boxes’ contents as in [Example 11[ If 
the agent signed the contract and choses two boxes, this incurs an additional 
cost of $2,000. 

• S := {E,F} where E means the opaque box is empty and E means the 
opaque box is full 

• A '.= where Bi means one-boxing and B 2 means two-boxing, 

S := Bi means signing the contract and N := B 2 means not signing (the 
set of actions has to be the same for all time steps) 

• 6 := {C,0,Oo,Ot,0-t,Om,Omt,Om-t} 

• uiOo) := 0, u{Ot) ■■= 1,000, u{0-t) := -1,000 u(Om) := 1,000,000, 
u{Omt) ■■= 1,001,000, u{Om-t) ■■= 999,000, u{C) := -300,000, u(0) := 
0 

Let e > 0 be a small constant denoting the prediction accuracy. Because the 
environment has to assign non-zero probability to all actions, e must be strictly 
positive. The environment’s distribution /i is defined as follows. Question marks 
stand for single actions or percepts whose value is irrelevant. 



fi{E) = fi{E) = 0.5 

^i{C\E,S) = l 


1 E) = ^liS 1 E) = 0.5 

^l{Q\E,N) = l 

MiV| 

E) = niN 1 E) = 0.5 

^i{C\F,S) = l 


n{Bi 1 E,m) = e 

Ai(0|E,iV) = l 


fi{Bi \F,N0) = l-e 

^l{Oo\E,mBl) = l 


h{B 2 \E,m) = l-e 

fiiOr 1 E,mB2) = 1 


^i{B2 1 E, iVO) = e 

h{Om\F,N0Bi) = 1 


^l{B 2 1 l,SC)=e 

fi{OMT 1 F, mB2) = 1 


^l{Bl\l,SC) = l-e 

^i{OM |E,5CBi) = l 
^i{OM-T\E,SCB2) = l 


The environment’s game tree is given as follows: 
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0.5 _ N 


1 






1 — £ 


B2 - 1 , 001,000 

Bi - 1,000,000 

B2 - 699,000 

Bi - 700,000 


B2 - 1,000 

Bi - 0 

B2 - 699,000 

Bi - 700,000 


There are four different possible policies: 

• Sign the contract and one-box (signing one-boxer) 

• Sign the contract and two-box (signing two-boxer) 

• Don’t sign the contract and one-box (refusing one-boxer) 

• Don’t sign the contract and two-box (refusing two-boxer) 

Using the formulas from [Example 12| we can calculate their value. We use 



T y-aeVjTT 

V.2 

^pev,. 

Ty-cau,7r 

V.2 

Signing one-boxer 

700,000 

700,00 

700,000 

Signing two-boxer 

699,000 

699,000 

699,000 

Refusing one-boxer 

990,000 

990,000 

500,000 

Refusing two-boxer 

11,000 

11,000 

501,000 


The highest values are displayed in italics. Both SAEDT and SPEDT refuse 
the contract: the refusing one-boxer has the highest action-evidential and the 
highest policy-evidential value. SCDT signs the contract and then one-boxes: 
the signing one-boxer has the highest causal value. 


Example 14 (Toxoplasmosis). This is a formalization of [Example 2| 

• S := {T,H} where T means having the toxoplasmosis parasite and H 
means being healthy 

• A := {P,N} where P means petting and N means not petting 

• £ := {P&r, NbT, PSzH, NSzH} where the percepts just reflect the action 
and hidden state 
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• u{PkT) ■- -9, u{NkT) ■- -10, u{PkH) := 1, u{NkH) := 0 where 
petting gives a utility of 1 and suffering from the parasite gives a utility 
of-10 


The environment’s distribution fj, is defined as follows. 


fi{T) = fi{H) = 0.5 

fi{PkT 

|P,P) = 1 

^iiP 1 T) = 0.8 

n{NkT 

iV,P) = l 

1 T) = 0.2 

fiiPkH 

P,P) = 1 

fi{P 1 P) = 0.2 

fi{NkH 1 

iV,P) = l 

fj,(N 1 P) = 0.8 




Using Bayes’ rule, we calculate the following conditional probabilities. 


^(T I P) = 0.8 I P) = 0.2 fi{T I N) = 0.2 fi{H \ N) = 0.8 

We consider lEDTI first. Since the percept ei is generated deterministically, 
/i(e I s, a) only attains values 0 or 1. We therefore omit it in the calculation 
below. For action P (petting) we get 

'■= I PMe) = XI XI I I ^)«(e) 

e^S e^Ss^S 

= fiiT I P)u(TkP) + fi{H I P)u{PkH) 

= 0.8 ■ (-9) +0.2 • 1 = -7 

For action N (not petting) we get 

■= X I N)u{e) = X X I I ^)«(e) 

e^E e^E sGS 

= fi{T I N)u{TkN) + n{H I N)u{HkN) 

= 0 . 2 -(- 10 )+ 0 . 8-0 =-2 

Therefore we get that EDT favors N over P: 

= -7 < -2 = 

For ICDTi we get for action P (petting) 


KT'' - X^(® I do(P))u(e) = X X ^l{e I s,P)fi{s)u{e) 

e^E e^Es^S 

= fx{T)u{TkP) + fj.{N)u{NkP) 
= 0.5-(-9) + 0.5-l = -4 

For action N (not petting) we get 

^^cau.AT I do(iV))u(e) = X X I s, N)fi{s)u{e) 

e^E bGEsGS 

= n{T)u{TkN) + fi{H)u{HkN) 
= 0.5-(-10)+ 0.5-0 =-5 

We get that CDT favors P over N: 

^;:x = -4>-5=u;;x 
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Example 15 (Sequential Toxoplasmosis). We here formalize a version of |Example 5[ 
First the agent chooses whether to go to the doctor. Going to the doctor incurs 
a fee, but removes the risk of getting sick. Agents that do not go to the doctor 
have a chance of meeting a kitten. If they meet it, they can choose to pet it or 
not; infected agents are more likely to pet the kitten. The example is intended 
to elucidate the difference between SAEDT and SPEDT, whose decisions we 
will calculate in detail. We will not calculate the action of SCDT. 

• S := {T(oxoplasmosis), il(ealthy)}. 

• A := {F(es), N{o)}. In this example, an action is taken twice. We use Yi 
and Y2, and A^i and N2, to distinguish between the first and the second 
action. 

• £ := {C'(ured), Ar(itten), S'(ick, not pet kitten), s(ick, pet kitten), P(et, 
not sick), 0(neutral)} 

• u{C) = —4, u{K) := 0, u{S) := —10, u{s) := —9, u{P) := 1, and u(0) = 0. 

The environment’s game tree is given as follows, where dashed lines connect 
states indistinguishable by the agent. 



Yi 

- C (-4) 




0-8 S (-10) 



Ni : 


0.8 


\ 

0.2 ^ - 


/ 




'' N2 


\ / 

1 / 

0.2 

1 ^ 

1 

Yi ■ 

- ‘C (-4) 

/ i 


\ 

/ 

/ \ 

\ 

0.2 


Ni 

— K (0): 





0.8 

^ N2 


5 (-9) 

*5 (- 10 ) 

P(l) 

0 ( 0 ) 


Eirst, the environment chooses whether to infect the agent or not with the 
parasite with probability 0.5. The agent then decides whether to see the doctor. 
If the agent sees the doctor, this incurs a (utility) fee of —4, but the agent will 
not be sick. If the agent does not see the doctor, there will be a kitten with 
probability 0.2 (or 1) and the agent will pet it with probability 0.8 (or 0.2) if 
the parasite is present (or not). If there is no kitten, the next percept is S' or 0 
depending on whether the agent is infected or not. The agent gets —10 utility 
if infected and did not see the doctor, and gets +1 utility for petting the kitten. 

We want to compare the choices of SAEDT and SPEDT. Their two-step 
value functions are 

ei 
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I {u{ei) + v;fj’’'(aiei)) 

ei 

where the second step value functions 

= X !^(®2 1016102) • ■0(62) 

62 

are the same for both decision theories. They only differ by assigning probability 
/r(ei I oi) and /i(ei | 711 : 2 ) to the first percept, respectively. 

Since not petting is always better than petting for evidential agents (the 
evidence towards not having the disease weighs stronger than the extra utility), 
the only policies that are potentially optimal and time consistent are tti := N 1 N 2 
and 712 := Yi- 


First percept. For tii the occurring action-evidential quantities /i(ei | oi) 
are 


1 1 


M(fVi) = ^M(s,iVi) = Ai(T,iVi) +/i(i/,iVi) = - + 


ses 


M(ei = S\Ni) = 


E.65M(s,iVi^) ^(r,iVi5) 


KNi) 




114 

2*2*5 

1_ 

2 


/r(ei = K I fVi) = 1 - fi{S \ Ni) = 


and the occurring policy-evidential quantities /i(ei | 711 : 2 ) are 


1 

2 


fi{NiN2) = ^ fi{s,NieiN2e2) 

S,ei ,62 


M(ei = K I fVi7V2) 


^(ei = S I 7ViA^2) 


fi{T, N1KN2S) + m(T, N1SN2O) + fi{H, N1KN2O) 
1 1 1 _ 31 

10 5 “ 100 

E,,e.Ms,lViif7V2e2) 

KNi,N2) 

tx{T, N 1 KN 2 S) + ^l{H, N 1 KN 2 Q) ^ + i ^ n 

m(IV,N2) ^ 31 

20 

N^N2) = - 


The policy 712 


{Yi} always goes to the doctor for the treatment, and so 


Mei = C I Yi) = 1 

for both AESDT and PESDT. 
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Second percept. With the policy tt 2 , the second percept is always empty. 
Under tti, the only action sequence that can reach the second percept is N 1 KN 2 


fi{NiKN2) 


/x(e2 = S I N 1 KN 2 ) 


Y, m ( s , N1KN2) = N1KN2) + N1KN2) 

S 

1 1 _ 21 
1^ 5 “ 100 

j:^^i{s,N,KN2S) ^iiT,N,KN2S) 1 

^iiNiKN2) ^iiNlKN2) ^ 21' 


Value Functions. We start by evaluating the recursive definition from the 
second time step. The second step value functions are 0 for tti and for the 
history NiS for 712 . For the history NiK, both SAEDT and PAEDT assign the 
following identical value to tt 2 '. 


= V^P^{N,K) = ^M(e 2 | N,KN2) ■ ^62) 

e2 


= fi{e 2 = S I N 1 KN 2 ) ■ u{S) + ^i{e 2 = 0 | N 1 KN 2 ) ■ u(0) 






The first step value functions now evaluates to: 


ei 

= I iVi) • (u(5) + 

+ I iVi) • (u(K) + 




^^(ei I Ni) ■ (u(ei) + 


I fViV2) • {u{s) + 

+ ^i{K I 7Vi7V2) • {u{K) + 




no 

'W 


-3.5 


Meanwhile, the value of 7r2 is 


= C 2 "’"' = E^(®i I ^ 1 ) + C2"’"'(^iei)) 

ei 

= ^^{C I Y,){u{C) + = 1. (-4 + 0) = -4 

That is, So SPEDT but not SAEDT 

prefers tti to 7r2. In other words, an SAEDT agent considers himself sufficiently 
likely to have the parasite to adopt policy 772 of seeing the doctor. The SPEDT 
agent relies on the fact that he would pet the cat in case he saw it, and takes 
that as evidence of not being sick. Hence he will instead adopt policy tti of not 
seeing the doctor. 
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